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INTRODUCTION 


The enclosed annual progress reports and publications describe 
and document the research performed by us with the support of NASA 
contract NASW 3317. This contract extended over a period character- 
ized by intense activity and startling discoveries in the interrelated 
areas of molecular biology, genetics, and evolutionary studies of 
prokaryotes, eukaryotes, and their viruses. 


PROPOSAL AND PROGRESS REPORT H' 


INVESTIGATION OF COMPOUNDS 
ESSENTIAL FOR THE ORIGIN OF LIFE 
NASW 3317 

National Biomedical Research Foundation 
Georgetown University I'tedical Center 
3900 Reservoir Road 
Washington, D.C. 20007 

August 28, 1980 



S * 



Principal Investigator Margaret 0. Dayhoff, Ph.D. 
Co-principal Investigator Robert M. Schwartz, Ph.D. 
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I. THREE-YEAR GOALS 

A. Prokaryote evolution and the emergence of major metabolic pathways 

B. Evolutionary inferences from nucleic acid sequence data 

C. Protein data collection 

II. PROGRESS REPORT 1/1/78 to 8/25/30 .^ 7.,... .. ’ 

NASA contracts NASW 3130, 3259, and 3317 

A. Introduction 

B. List of publications 

C. Nucleic acid sequence reference data collection 

1. Editorial and response published in Nature 

2. Letter in press:. Science 

3. Computer terminal display for our Demonstration System 

4. Table of contents for our nucleic acid sequence reference data base 
, 5. Sample pages from our data base 

D. Published papers - : 

E. New entries and their protein superfamilies (current protein collection. 
May 1980) 

F. Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, ed. M.O. 
Dayhoff, National Biomedical Research Foundation , Washington, D.C., 1979, 
,414 pp. 

10 copies of this book have been submitted separately 

G. Protein Segment Dictionary 78, M.O. Dayhoff, L.T. Hunt, W.Cc Barker, R.M. 
Schwartz, and B.C. Orcutt, National Biomedical Research Foundation, 
Washington, D.C., 1978, 470 pp. 

10 copies of this book have been submitted separately 

III. BIOGRAPHICAL SKETCHES 

IV. PROPOSED BUDGET 9/1/81 - 8/31/82 


II, PRCX3RESS REPORT 1/1/78 - 8/25/80 


Introduction 

In early 1978, we published an article in Science that synthesized all of 
the available sequence data pertinent to bacterial and blue-green evolution 
and to the origin of eukaryoce organelles. The articles in press from the 6th 
International Conference on the Origins of Life together with that in press 
frc»n the International Colloquium on Endosymbiosis and Cell Research {see 
Section II-D) update that study. They suggest that the origin of the eukaryote 
organelles, the mitochondria and the chloroplasts, were not only endosymbiotic 
but also polyphyletic, i.e., organelles in different lines of descent arose 
from different bacteria or blue-greens. 

The paper submitted to Nature with Dr. Barnabas initiated a new line of 
interest for us. In that work we correlated the metabolic capabilities of 
bacterial groups for which sequence data are available with their evolutionary 
position based on the sequence data. From this, we begin to infer the order in 
which a variety of metabolic pathways developed during the Precambrian. These 
metabolic capabilities include fermentation, anaerobic respiration, bacterial 
anoxygenic photosynthesis, sulfate reduction, aerobic respiration, and 
oxygenic photosynthesis. 

Recent breakthroughs in DMA and RNA sequencing techniques have greatly speeded 
the elucidation of these sequence data. Much of the new data is a natural 
adjunct to our protein data collection, particularly the sequences of complete 
genomes, genes, and messenger RNAs. Other sequences, although less direct in 
their connection, are still extremely important, for example, control signals, 
ribosomal-binding sites, and origins of replication. During the last year, we 
have developed a computerized nucleic acid sequence data base and programs for 
data entry and retrieval as a demonstration project. Our demonstration project 
has as its goal showing what is necessary to make this new detailed genetic 
data intellectually accessible. In the first section of this report, we have 
included items describing the current state of our data base: 

1. An editorial that appeared in Nature pointing out the need for such a 


data base and our reply to that editorial. 


2. A letter to be published in Science announcing the public availability 
of our data base. 


3. Computer terminal display for our demonstraton system. 


4. A table of contents of the data base as it currently stands as well 


as a sample of the data entries. 
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IP ANY MEMBERS OF THE GROUP REVIEXVIN3 THIS PROPOSAL WOULD LIKE ACCESS TO 
OUR NUCLEIC ACID SEQUENCE REFERENCE DATA BASE, THEY MAY CAa EITHER DR. 
DAYHOFF OR DR. SCHWARTZ AT (202) 525-2121. 


Clearly, making the data accessible is only the first step in the 
research process. Our NASA contract has supported that portion of this data 
collection bearing on origin of life studies. Additionally, wa have requested 
supplemental funds to support one senior staff member during the four months 
the demonstration project will be on line in order to help update the 
retrieval system and modify our programs in response to user needs. 

Wa have continued to maintain a reference data collection of protein 
sequences. Our NASA contract supports that part of the data collection and 
analysis that is of interest to the'^study of the origin and early evolution of 
life. In 1979, we published supplement 3 to voliime 5 of the Atlas of Protein 
Sequence and Structure and a Protein Segment Dictionary (both submitted 
separately). We are currently working toward the publication of volume 5 of 
the Atlas at the end of 1982. This will be a comprehensive book including new 
data as well as combining and updating the information in volume 5 and its 
three supplements. A list of the new protein data arranged hierarchically by 
evolutionary relationship is shown in Section II-E. 


List of Publications 1/1/78 - 8/25/80 

' - 7 / ' ^ 

Books Published; 

Atlas of Protein Sequence and Structure , Vol. 5, Suppl. 3, ed. M.O, 
Dayhoff, National Biomedical Research Foundation, Washington, D.C., 1978, 
414 pp. 

Protein Segment Dictionary 78 ^ M.O. Dayhoff, L.T. Hunt, W.C. Barker, R.M. 
Schwartz and B.C. Orcutt, National Biomedical Research Foundation, 
Washington, D.C,, 1978, 470 pp. 


Other Output; 

Protein sequence Data Taps, Atlas of Protein Sequence and Structure , M.O. 
Dayhoff, L.T, Hunt, W.C. Barker and R.M. Schwartz, National Biomedical 
Research Foundation, Washington, D.C., 1978. [119,006 residues from 1,081 

sequences] 


Papers Published: 

An outline of biological evolution based on macromolecular sequences. R.M. 
Schwartz, M.O. Dayhoff. GCMPARAITVE PLANETOLOGY, ed. by C. Ponnamperma, 
pp. 225-242. Academic Press, N.Y. , 1978. 

Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts. R.M. 
Schwartz and M.O. Dayhoff, Science 199; 395-403, January 27, 1978. 

The point mutation process in proteins. R.M. Schwartz and M.O. Dayhoff, 
in; Origin of Life; Proceedings of the Second ISSOL Meeting, the Fifth 
ICOL Meeting, Haruhiko Noda, editor, Center for Academic Publications 
Japan/Japan Scientific Societies Press, 1978, pp. 457-469. 

Evolution of early life inferred from protein and ribonucleic acid 
sequences. M.O. Dayhoff and R.M. Schwartz, in: Origin of Life: Proceedings 
of the Second ISSOL Meeting, the Fifth ICOL Meeting, Haruhiko Noda, editor. 
Center for Academic Publications Japan/Japan Scientific Societies Press, 
1978, pp. 547-560. 

Detection of distant relationships based on point mutation data. R.M. 
Schwartz and M.O. Dayhoff, Evolution of Protein Moleculaes, ed. H. 

Matsubara and T. Yamanaka, pp. 1-16. Center for AcadoT\ic Publications 
Japan/Japan Scientific Societies Press, Tokyo, 1978. 

Evolution of prokaryotes inferred from sequences. M.O. Dayhoff and R.M. 
Schwartz, Evolution of Protein Molecules, ed. H. Matsubara and T. Yamanaka, 
pp. 323-42. Center for Academic Publications Japan/Japan Scientific 
Societies Press, Tokyo, 1978 r 

Protein and nucleic acid sequence data and phylogeny. R.M. Schwartz and 
M.O. Dayhoff. Science, 205 (4410): 1036-39, 7 Sept. 1979. [Exchange of 
Technical. Comments ^with Vincent Demoulin] 


Evolutionary relationships among photosynthetic prokaryotes inferred from 
protein and nucleic acid sequence data. R.M. Schwartz and M.O, Dayhoff. 
Third International Symposium on Photosynthetic Prokaryotes, Oxford, 1979. 
Abstracts. E7 

Prokaryote evolution and the symbiotic origin of eukaryotes, M.O, Dayhoff 
and R.M. Schwartz. Proceedings of the International Colloquium in 
Endosymbiosis and Cell Research, April 11-15, 1980, Tubingen, Germany. 
Berlin; Walter deGruyter & Co., 1980. In press. 

Phylogenetic sequence of metabolic pathways in precambrian cellular life. 

J. Barnabas, R.M. Schwartz, and M.O. Dayhoff, Proceedings of the 6th 
International Conference on the Origins of Life. Dordrecht, The 
Netherlands; Reidel, 1980. In press. 

The evolution of blue-greens and the origins of chloroplasts. R.M. 

Schwarts and M.O. Dayhoff. Proceedings of the 6th International Conference 
on the Origins of Life. Dordrecf, The Netherlands; Reidel, 1980. In 
press . f , 

Evolution of the rhodospirillaceae and mitochondria; a view based on 
sequence data. M.O. Dayhoff and R^M. SiSh'/artz. Proceedings of the 6th 
International Conference on the Origins of Life. Dordrecht, The 
Netherlands; Reidel, 1980. In press. 

ti ti ■ g 

Paper submitted for publication; 

Evolution of major metabolic innovations in the precambrian. J. Barnabas, 
R.M. Schwartz, and M.O. Dayhoff. Submitted to juktura, June 1980. 
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New Envries and Their 


Protein Superfamilies 
' Up-date of May 1980 

M.O. Dayhpff, H.R, Chen, B,C. Orcutt, W.C, Barker, L,T, Hunt, 

and R.M, Schwartz 

NBR Report 08710-800515 

National Biomedical Research Foundation 
Georgetown University Medical Center 
3900 Reservoir Road, N.V/. 

Washington, D.C. 20007 


CONTENTS 


1. Superfamily list from the Atlas, Suppl. 3, containing the complete/ 
sequences , 


2. Explanation of computer listings of new entries. 

> 

3. Computer listing of new, complete or almost complete sequences with 
their superfamily classification. 

i). Alphabetical listing of other new entries. 


2 Protein Superfamiiies 


M>0^ Dayhoff, W,C. Barker, L T. Hunt, and R.M^ Schwartz 


In the list that follows, we hove organized all of the 
complete sequences reported In the Atlas volumes Into 
groups of superfamiiies, families, subfamilies, and entries, 
The number In each group, the criteria for clustering, and 
the method of Identification of the hierarchical levels In 
the list are shown below, 


Number 

of 

Groups 

Group 

Criteria for 
Clustering 
Sequences 

Identifica- 
tion of 
Cluster 

181 

Superfamiiies 

Probability of 

Number 



similarity by 




chance <10“® 


314 

Families 

<50% different 

Letter 

537 

Subfamilies 

< 20% different 

Paragraph 

793 

Atlas entri^'^ 

<5% different 

Semicolon 

Sequences 

In the snnr=ti'’ are separated by commas. 


This list upd^fies tite that appeared In Supplement 2} 
in which there vi/ert; i16 superfamiiies, 197 families, 328 
subfamilies, and 493 entries, There has been about a 60% 
increase in all categories in the intervening 2 years and 7 
months. 

Only complete or nearly corr,plete sequences that are 
20 or more residues in length are included. The constant 
and variable regions of immunoglobulins are counted as 
separate sequences. Sequences that can be considered 
complete in one sense but partial in another are generally 
included. Examples are active hormone and enzyme se- 
quences that are derived from longer precursors, and se- 
quences of entire homology regions from proteins with 
two or more such regions. 

Proteins within a family usually differ at fewer than 
half of their amino acid positions and they are either 
homologs in various species or products of gene duplica- 
tion; their similarity of function has usually been recog- 
nized before the sequences were known and they have 
identical or very similar names. Families are identified by 
ietters in the list. 

The sequences within a family have been divided into 
subfamilies, which are shown as paragraphs. Sequences 
within a subfamily usually differ from each other at fewer 
than 20% of their amino acid positions. Within a subfami- 
ly, sequences that differ by less than 5% and form a single 


Atlas entry ore separated by commas, whereas sequences 
or groups that are more than 6% different are separated 
by semicolons. 

In a clustering procedure such as this there will always 
be cases that are borderline, some pairs within a group 
.being below the cutoff and some above. Where possible, 
we have grouped together proteins that fall on the same 
branch of an evolutionary tree. 

The families are grouped into superfamiiies,®'^ iden* 
tified by numbers, whore similarity of sequences in 
different families can be recognized by statistical proce- 
dures. We have used two such methods to compare pairs 
of complete sequences; for sequences of comparable 
length we used a method based on the best alignmeni, of 
the two sequences; for sequences of quite different length, 
we used a method based on the distribution of scores 
obtained on comparison of all segments of a given length 
from one sequence with those from the other. These 
methods are described In detail in chapter 1 . Each method 
produces a probability that the scores from the compari- 
son of two real sequences could have been derived from 
the distribution of scores produced by comparisons of 
pairs of randomly permuted sequences with the same 
amino acid compositions as the two real sequences. 

A newly determined sequence is placed in an existing 
superfamily if, on comparison with the best conserved 
sequence from each family in that superfamily, at least 
one probability of <10"® is obtained. For a collection of 
314 families that might potentially be combined, 
(314 X 3131/2 = 49,141 comparisons are possible. The 
probability of finding a score of 10*® by chance in’ one or 
more of these is 5%. Thus, we have 95% confidence that 
all of the families that have been grouped together really 
share significant sequence similarity, 

The ultimate superfamily list could be derived from 
sequence Information alone, provided that at least one 
sequence was known from most subfamilies within each 
family. At present we do not have this much sequence 
information, but often we have information on chemical 
or physiological functions that reflect relationship, WhS'«:e 
we know in advance that several proteins share a similar 
function, we have required that the probability for a sl~- 
gif' comparison within the group be <10'® in order to 
cluster the sequences in the same superfamily. 


10 ATLAS OF PROTEIN SEQUENCE AND STRUCTURE 1978 


ORiGiNAL PAGE JiJ 
OF POOR QUALITY 


It is olso possible to establish relationships on th( basis 
of search scores (see chapter 1 ). There are approx'mately 
10® 20‘residue segments in the data collection, If a sog* 
ment of 20 residues Is compared with ail of the 10® ctrar 
segments if. the collection, an 'Approximately normal dis* 
tributlon of scores is obtained. Prom the mean and stan- 
dard deviation of this distribution, the probability of find- 
ing a score equal to or greater than any given score can be 
calculated. In principle, 10® such searches, one for each 
20-residue segment, could be performed, leading to the 
accumulation of 10® X 10® “ 10^® probabilities. If the 
probability associated with a given score ls<S0.5X 10*^^, 
there is a probability of approximately 0.6 X lO*^^ X 
10^“^ = 0.05 pf finding one such score by chance h an 
exhaustive interoomparlson of all segments. The probabili- 
ty of 0.5 X 10'^^ calculated from the normal distribution 
corresponds to a confidence level of 95% that the se- 
quence similarities discovered by search scores are un- 
usual enough to indicate relationship, We feel that very 
low probabilities are a reflection of *he common evolu- 
tionary origin of the proteins. Other similarities of struc- 
ture, function, and control would therefore be predicted. 

Superfamily Groups 

Most of the family relatlonsh’.'s in this list were 
pointed out in the papers describing the sequence work 
and are referenced in the data pages. Quantitative data on 
relationships are given in chapter 10 of the Atlas, Volume 
5,** in the Survey of New Material of Supplement l,®and 
in many tables in Supplement 2, as well as in this book, 
We have applied these quantitative criteria for defining 
relationship to the suggestions of others and to the hope- 
ful leads that we have turned up by extensive searching of 
the data in organizing this list. 

We have grouped together several proteins of similar 
function that get borderline probabilities of sequence 
similarity, including pancreatic hormone from chicken 
with glucagon and secretin, antibacterial substance A and 
neocarzinostatin from Stroptomyces, ferredoxins with 
adrenodoxin and putidaredoxin, the fungal with the 
bacterial ribonucleases, and peanut protease inhibitor and 
bromelain inhibitor with the Bowman-Birk type protease 
inhibitors, In other instances we have chosen not to com- 
bine borderline cases. The four histones would be com- 
bined on the basis of comparisons using the identity 
matrix but would not be combined using the mutation 
data matrix. There are a number of short sequences that 
are repeated in at least two histone groups. However, 
there have been many insertions and deletions as well as 
point mutations, so we have left the four groups as sepa- 
rate superfamilies. Human epidermal growth factor (EGF) 
and a small part of bovine factor X are clearly related, We 
suspect that EGF may even be a degradation product of 


an as yet unsequenced serine protease, Because of its dis- 
tinct function, wo have left EGF as a distinct superfamily 
until the situation is clarified, Bird apovltellenins and the 
hun^an lipid-blndlng proteins have been left in separate 
superfamilies. Additional groups of protease Inhibitors 
may eventually be combined when more sequences are 
known, 

Two groups with similar functions and three-dimen- 
sional structures do not display significant sequence 
similarity. The dehydrogenases (alcohol, lactate, glutamate, 
and glyceraldehyde 3-phosphate) are separate super- 
families, as are the constant and variable regions of the 
immunoglobulins. Presumably in both of these cases there 
have been too many insertions, deletions, and point muta- 
tions to deduce a common evolutiona.y origin from the 
sequences. 

Relationships among some of the superfamilies may 
eventually be demonstrated as more extensive sequence 
information becomes available for each family, permitting 
the con.struction of ancestral sequences for which the 
r.iutability of each residue can be estimated, Additional 
information on relationships. may be derived from the 
similarity of amino acid compositions, 

A further organization of superfamilies reflecting 
common evolutionary origin may be possible based on 
additional nonsequence information: for example, the 
three-dimensional structures or the positions in a 
metabolic pathway. In this list, the superfamilies are 
grouped according to function or prosthetic group. 

It has been estimated that in humans there are approxi- 
mately 50,000 proteins of functional or medical impor- 
tance. We conjecture that these will be grouped into about 
500 superfamilies, each containing an average of 100 
sequences that range from minor variants up to 85% 
or 90% different from one another, A similar number of 
superfamilies has been proposed by Zuckerkandl.® A 
landmark of molecular biology will occur when one mem- 
ber of each superfamily has been elucidated. At the 
present rate of 25 per year, this will take less than 15 
years. 

References 
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Explanation of Computer Listings of New Entries 

We have examined the relationships between all of the new sequenees and the 
ones already in the collection. Each sequence in the Suppl, 3 superfamily 
list has been assigned five numbers, according to its superfamily number, 
its position among the families of the superfamily, its position among the 
subfamilies of its family (paragraphs), its position among the entries of 
its subfamily (strings separated by semicolons), and its position in the 
string of sequences in an entry. Each new item has bean assigned five 
numbers that place it between two other entries in the list, where it 
belongs. We show the first four of these numbers on the 'updated 
superfamily list. Thus the first sequence on the list, Cytochrome c-Rice, 
belongs in the first superfaroily and the first family of cytochrome c 
related proteins. It is in the l^th subfamily, in between the third and 
fourth entries, sesame and castor. Similarly, the C-phycocyanin alpha 
chains (No. 5.2) form a new superfamily coming between cytochrome bcg 2 
5) and ferredoxin (Mo. 6). Before publication the entire list can be 
renumbered using only integers, and a superfaraily ,list similar to the one 
already published can be printed out by the computer. 

Some of the new entries contain short sequences or fragments of longer 
sequences and have not been assigned superfamily numbers. These are listed 
separately in alphabetical order. 

The two computer listings contain 396 items. Of these, 358 are 
totally new entries, whereas 38 are revisions to published Atlas entries, 
usually the completion of a sequence for which ‘only fragmentary information 
was formerly known. 


New, complete or almost complete sequences of 20 or more residues with their superfamily classifications 
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